Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add unreferenced object filtration KEP #1742

Merged
merged 1 commit into from Sep 2, 2021

Conversation

shaneutt
Copy link
Member

Which issue this PR fixes

Resolves #1259

Special notes for your reviewer:

Keep in mind that this is a provisional step and is mostly in place to start gathering alignment and come to a decision.

Maintainers please feel free to make commits directly to this branch with ideas, updates, and fixes to the KEP so long as you maintain the git history.

@shaneutt shaneutt added priority/low area/kep Enhancment Proposals labels Aug 20, 2021
@shaneutt shaneutt self-assigned this Aug 20, 2021
@shaneutt shaneutt temporarily deployed to Configure ci August 20, 2021 19:40 Inactive
@shaneutt shaneutt force-pushed the kep-unreferenced-object-filtration branch from b7c187e to c3b3227 Compare August 20, 2021 19:41
@shaneutt shaneutt marked this pull request as ready for review August 20, 2021 19:41
@shaneutt shaneutt requested a review from a team as a code owner August 20, 2021 19:41
@shaneutt shaneutt temporarily deployed to Configure ci August 20, 2021 19:52 Inactive
Copy link
Member

@mflendrich mflendrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand this write-up is that we have the following motivations:

(1) we want to reduce logging for events that are not meaningful
(2) we want to save CPU cycles by not rebuilding config when a non-meaningful change happens

The problem with the former is that, at the point of reconciliation of, say, a Service, we don't know whether it's meaningful or not - a few minutes later an Ingress may be created that references it, and relies on the presence of that Service in the cache. Something we can definitely do, though, is suppress logging/processing specifically for Endpoints that don't have a link to a live ingress. But maybe a key to that would be building a completely different (and much simpler) mechanism that tracks endpoint changes on a config generated by us?

On the latter, I'm unclear on the ROI of the variants of implementation that we're talking about here. The unknown ROI, plus the apparent lack of a specific demonstrated performance bottleneck, gives me low confidence that this problem is a real issue.

That said, my thinking is that we should aim for solving (1) and drop (2) for the time being. (1) is a much easier problem because the set of "spammy" resources is quite limited and predictable, and a symptomatic solution may be much simpler and less maintenance-heavy going forward.

Suggestion:

  • rewrite Motivation to only leave - reduce noise in logs for the KIC controller manager
  • perform RCA of why unnecessary log prints make it to the log
  • try a more targetted approach for the above.

@mflendrich
Copy link
Member

Maybe the entire problem got solved by #1755 ?

@shaneutt
Copy link
Member Author

shaneutt commented Aug 26, 2021

The way I understand this write-up is that we have the following motivations:

(1) we want to reduce logging for events that are not meaningful

Suggestion:

* rewrite `Motivation` to only leave `- reduce noise in logs for the KIC controller manager`
* perform RCA of why unnecessary log prints make it to the log

The logging isn't even a component of this any longer after #1755 and I've removed this from the motivation as its no longer a factor.

Maybe the entire problem got solved by #1755 ?

I would argue no, we are still spending compute cycles and cache memory to process and store objects which we wont do anything with, and at scale this waste only becomes more prominent.

(2) we want to save CPU cycles by not rebuilding config when a non-meaningful change happens

The problem with the former is that, at the point of reconciliation of, say, a Service, we don't know whether it's meaningful or not - a few minutes later an Ingress may be created that references it, and relies on the presence of that Service in the cache.

Please check the design ideas section of this KEP: this exact problem is being considered and solutions have already been proposed to solve it. I would welcome some feedback on those solutions, particularly if you think there's any serious problem with them.

plus the apparent lack of a specific demonstrated performance bottleneck, gives me low confidence that this problem is a real issue.

Did you see the drawbacks section? There was awareness of this problem when the KEP was written and its suggested that we may need to consider this KEP declined until such a time where there are real world problems stemming from this.

I'm slightly in favor of doing something (re: the namespace reference solution, which is minimal but very simple and clean) but if you feel strongly we should decline this and wait until we see real world reports I may be persuaded 🤔

@shaneutt shaneutt temporarily deployed to Configure ci August 26, 2021 13:55 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci August 26, 2021 19:41 Inactive
@shaneutt shaneutt temporarily deployed to Configure ci August 27, 2021 17:41 Inactive
@rainest
Copy link
Contributor

rainest commented Aug 27, 2021

API server interactions and memory are the resources I expect to reduce with this--CPU time would also decrease but I'd expect the gain there is negligible. Granted, we can't say for sure that we're not taking too long to generate config within the controller absent user reports or test data, but I'd expect it's fast enough (completes before the next tick) based on a lack of complaints so far and dwarfed by network time to transfer the completed config anyway.

#1447 indicates an API server delay for Secrets specifically (maybe specific to them because of the way they're stored).

Per https://discuss.konghq.com/t/linear-increase-in-ingress-controller-memory-with-increasing-number-of-ingress-routes/8824 the (unremarkable) memory consumption is that more objects equals more memory consumption. It's possible that most consumption is the final product rather than the K8S object cache--both my and that user's testing used test data that rendered all test objects into Kong config--we'd need another test with many objects that only get into the cache but are never used by the parser to see if there's a significant benefit there.

I'll repeat my log statement from earlier: we kinda sidestepped the problem there since those logs should be of interest for objects we do want to ingest, and we just got rid of them entirely. The improvement there if we had this filter would be that we could re-add them.

@shaneutt shaneutt force-pushed the kep-unreferenced-object-filtration branch from 0f84169 to 0ee6c35 Compare August 27, 2021 19:27
@shaneutt
Copy link
Member Author

I'll repeat my log statement from earlier: we kinda sidestepped the problem there since those logs should be of interest for objects we do want to ingest, and we just got rid of them entirely.
The improvement there if we had this filter would be that we could re-add them.

0ee6c35 includes updates to the motivations and goals intended to capture this desire for the logging that would be possible once the filtration is in place.

@shaneutt shaneutt temporarily deployed to Configure ci August 27, 2021 19:37 Inactive
@shaneutt
Copy link
Member Author

shaneutt commented Sep 2, 2021

Given the feedback on this so far I'm inclined to do the following:

  • add a note at the top of the KEP that we've decided to put this on hold as we couldn't find sufficient justification (in the form of reported production issues) for this problem at the time
  • mark the KEP status as declined
  • merge the KEP in declined state for posterity so that contributors know this is something we've considered previously and is something we could pick back up if new data provides priority to do so

@mflendrich / @rainest since you've both commented here, let me know your thoughts on that please? 🤔

@rainest rainest temporarily deployed to Configure ci September 2, 2021 17:53 Inactive
rainest
rainest previously approved these changes Sep 2, 2021
Copy link
Contributor

@rainest rainest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that's a proposed change to the PR and not the change itself? I approve the changes in the comment, but they don't appear realized yet 😆

Pushing the actual approve button as I trust you will indeed do that.

We probably should have a standing performance thread in discussions, since performance is never really strictly an issue by itself (outside particularly egregious problems), but more a general description of how various factors influence various other factors. Stuff like the findings in #1465 is more contributing to our ever-growing body of knowledge of KIC's performance characteristics; there's never really any "performance is done!" moment.

@shaneutt
Copy link
Member Author

shaneutt commented Sep 2, 2021

@rainest d53bbcf adds a note to the top that this is effectively "frozen" and marks the status as declined.

@shaneutt shaneutt temporarily deployed to Configure ci September 2, 2021 18:37 Inactive
@shaneutt shaneutt merged commit 4974906 into main Sep 2, 2021
@shaneutt shaneutt deleted the kep-unreferenced-object-filtration branch September 2, 2021 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KIC 2.0 - Tighten Service/Endpoints Watch
3 participants